Finalizing the Synthetic Data Generation
After completing the input selection and model configuration, users can fine-tune the synthetic dataset by choosing specific columns, applying filters, and reviewing a final summary before generating the data.
Steps to Finalize and Generate the Dataset
1. Final Summary
The Final Summary tab provides a complete overview of all configurations for generating the synthetic dataset. Here, users can verify the following sections:
-
Model Configuration:
- Dataset Name: The name you assigned to the dataset.
- Modeling Type: The modeling approach used, such as LLM (Large Language Model).
- Selected Model: The model chosen for data generation, e.g., GPT-4.
- Custom Prompt: Any custom prompt that has been used to guide the data generation.
This final summary allows users to review model settings and generation parameters before proceeding.
-
Data Configuration:
- Input Source: Indicates whether the input was provided via a file or a database.
- File Type: For file-based inputs, this shows whether the input is JSON, Text, etc.
- Sample Size: The number of synthetic data points to be generated (e.g., 10 data points).
- Temperature and Top P: Controls the creativity and variability of the generated data.
- Imputed Columns: Any columns with missing data that will be imputed.
- Rebalancing Options: If the dataset is adjusted for fairness or other conditions.
These configuration details allow the user to fine-tune the synthetic data output for more accurate results.
2. Data Output Options
After reviewing the final summary, the user can choose the output method for the generated synthetic dataset:
-
Download as File:
- Export the dataset in formats like JSON, CSV, or any other preferred file type.
-
Store in Database:
- The generated dataset can also be stored directly into a connected database for further analysis or processing.
These versatile output options ensure that the generated data can be efficiently used, whether downloaded for immediate use or integrated into an existing database.
Example Output Configuration:
The following images show an example of the output configuration:
- Download Format: JSON
- Output Type:
enum
(customizable based on user preference) - Generated Data Points: 10
After verifying the output configuration, users can click the Start Generation button to begin generating the synthetic data based on the specified settings.
Conclusion
The final summary page provides a detailed overview of all configuration settings before data generation begins. By offering flexibility in input configurations and export formats, this tool ensures that the synthetic dataset meets the project's specific requirements. Whether downloaded directly or stored for later use, the generated dataset is ready to be integrated into your AI workflows, with powerful models like GPT-4 producing high-quality, tailored synthetic data.
With this comprehensive overview, users can confidently proceed with the synthetic data creation process, knowing they have full control over the configuration and output of their datasets.